A Comparative Study on Outlier Removal from a Large-scale Dataset using Unsupervised Anomaly Detection

نویسندگان

  • Markus Goldstein
  • Seiichi Uchida
چکیده

Outlier removal from training data is a classical problem in pattern recognition. Nowadays, this problem becomes more important for large-scale datasets by the following two reasons: First, we will have a higher risk of “unexpected” outliers, such as mislabeled training data. Second, a large-scale dataset makes it more difficult to grasp the distribution of outliers. On the other hand, many unsupervised anomaly detection methods have been proposed, which can be also used for outlier removal. In this paper, we present a comparative study of nine different anomaly detection methods in the scenario of outlier removal from a large-scale dataset. For accurate performance observation, we need to use a simple and describable recognition procedure and thus utilize a nearest neighbor-based classifier. As an adequate large-scale dataset, we prepared a handwritten digit dataset comprising of more than 800,000 manually labeled samples. With a data dimensionality of 16×16 = 256, it is ensured that each digit class has at least 100 times more instances than data dimensionality. The experimental results show that the common understanding that outlier removal improves classification performance on small datasets is not true for high-dimensional large-scale datasets. Additionally, it was found that local anomaly detection algorithms perform better on this data than their global equivalents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm

Unsupervised anomaly detection is the process of nding outliers in data sets without prior training. In this paper, a histogrambased outlier detection (HBOS) algorithm is presented, which scores records in linear time. It assumes independence of the features making it much faster than multivariate approaches at the cost of less precision. A comparative evaluation on three UCI data sets and 10 s...

متن کامل

A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.

Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for exampl...

متن کامل

Outlier Detection Using K-Mean and Hybrid Distance Technique on Multi-Dimensional Data Set

Outlier Detection is a major issue in data mining. Outliers are the containments that divert from the other objects. Outlier detection is used to make the data knowledgeable, and easy to understand. There are many type of databases used now days, and many of them contains anomaly objects, detection or removal of these objects is known as outlier detection. In the proposed work outliers are dete...

متن کامل

Learning Representations for Outlier Detection on a Budget

The problem of detecting a small number of outliers in a large dataset is an important task in many fields from fraud detection to high-energy physics. Two approaches have emerged to tackle this problem: unsupervised and supervised. Supervised approaches require a sufficient amount of labeled data and are challenged by novel types of outliers and inherent class imbalance, whereas unsupervised m...

متن کامل

Effective Outlier Detection using K-Nearest Neighbor Data Distributions: Unsupervised Exploratory Mining of Non-Stationarity in Data Streams

We describe approaches and preliminary experiments that are aimed at monitoring and detecting change in self-monitored data streams. We introduce a new algorithm for outlier detection using K-Nearest Neighbor Data Distributions. We run experiments on a variety of data stream topologies and thereby demonstrate the effectiveness of the new algorithm in detecting outliers and in quantitatively est...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016